115 research outputs found

    Computational methods for the discovery and analysis of genes and other functional DNA sequences

    Get PDF
    The need for automating genome analysis is a result of the tremendous amount of genomic data. As of today, a high-throughput DNA sequencing machine can run millions of sequencing reactions in parallel, and it is becoming faster and cheaper to sequence the entire genome of an organism. Public databases containing genomic data are growing exponentially, and hence the rise in demand for intuitive automated methods of DNA analysis and subsequent gene identification. However, the complexity of gene organization makes automation a challenging task, and smart algorithm design and parallelization are necessary to perform accurate analyses in reasonable amounts of time. This work describes two such automated methods for the identification of novel genes within given DNA sequences. The first method utilizes negative selection patterns as an evolutionary rationale for the identification of additional members of a gene family. As input it requires a known protein coding gene in that family. The second method is a massively parallel data mining algorithm that searches a whole genome for inverted repeats (palindromic sequences) and identifies potential precursors of non-coding RNA genes. Both methods were validated successfully on the fully sequenced and well studied plant species, Arabidopsis thaliana --Abstract, page iv

    A quantitative study of gene identification techniques based on evolutionary rationales

    Get PDF
    Current gene identification (GI) techniques typically rely on matching biological or chemical properties of specific genes, specific species, specific ecotypes, etc...In this thesis, a new automated GI technique is proposed, and compared against another computer-based technique proposed earlier. Both methods utilize EST data available from NCBI databases to discover previously unknown genes. The newly proposed method identifies one gene family at a time and is based on a distinctive negative selection pattern (NSP) of differences, which is seen between the coding regions of gene family members. The other technique, called ESTminer, attempts genome-wide gene family identification for any organism, by detecting single nucleotide polymorphisms between potential family members. In this thesis, a complete automated analysis of both techniques is presented --Abstract, page iii

    Protein Secondary Structure Prediction using Parallelized Rule Induction from Coverings

    Get PDF
    Protein 3D structure prediction has always been an important research area in bioinformatics. In particular, the prediction of secondary structure has been a well-studied research topic. Despite the recent breakthrough of combining multiple sequence alignment information and artificial intelligence algorithms to predict protein secondary structure, the Q3 accuracy of various computational prediction algorithms rarely has exceeded 75%. In a previous paper [1], this research team presented a rule-based method called RT-RICO (Relaxed Threshold Rule Induction from Coverings) to predict protein secondary structure. The average Q3 accuracy on the sample datasets using RT-RICO was 80.3%, an improvement over comparable computational methods. Although this demonstrated that RT-RICO might be a promising approach for predicting secondary structure, the algorithm\u27s computational complexity and program running time limited its use. Herein a parallelized implementation of a slightly modified RT-RICO approach is presented. This new version of the algorithm facilitated the testing of a much larger dataset of 396 protein domains [2]. Parallelized RTRICO achieved a Q3 score of 74.6%, which is higher than the consensus prediction accuracy of 72.9% that was achieved for the same test dataset by a combination of four secondary structure prediction methods [2]

    Transcriptomic responses of the heart and brain to anoxia in the Western Painted turtle

    Get PDF
    Painted turtles are the most anoxia-tolerant tetrapods known, capable of surviving without oxygen for more than four months at 3°C and 30 hours at 20°C. To investigate the transcriptomic basis of this ability, we used RNA-seq to quantify mRNA expression in the painted turtle ventricle and telencephalon after 24 hours of anoxia at 19°C. Reads were obtained from 22,174 different genes, 13,236 of which were compared statistically between treatments for each tissue. Total tissue RNA contents decreased by 16% in telencephalon and 53% in ventricle. The telencephalon and ventricle showed ≥ 2x expression (increased expression) in 19 and 23 genes, respectively, while only four genes in ventricle showed ≤ 0.5x changes (decreased expression). When treatment effects were compared between anoxic and normoxic conditions in the two tissue types, 31 genes were increased (≥ 2x change) and 2 were decreased (≤ 0.5x change). Most of the effected genes were immediate early genes and transcription factors that regulate cellular growth and development; changes that would seem to promote transcriptional, translational, and metabolic arrest. No genes related to ion channels, synaptic transmission, cardiac contractility or excitation-contraction coupling changed. The generalized expression pattern in telencephalon and across tissues, but not in ventricle, correlated with the predicted metabolic cost of transcription, with the shortest genes and those with the fewest exons showing the largest increases in expression

    Identification of a pan-cancer oncogenic microRNA superfamily anchored by a central core seed motif

    Get PDF
    MicroRNAs modulate tumorigenesis through suppression of specific genes. As many tumour types rely on overlapping oncogenic pathways, a core set of microRNAs may exist, which consistently drives or suppresses tumorigenesis in many cancer types. Here we integrate The Cancer Genome Atlas (TCGA) pan-cancer data set with a microRNA target atlas composed of publicly available Argonaute Crosslinking Immunoprecipitation (AGO-CLIP) data to identify pan-tumour microRNA drivers of cancer. Through this analysis, we show a pan-cancer, coregulated oncogenic microRNA ‘superfamily’ consisting of the miR-17, miR-19, miR-130, miR-93, miR-18, miR-455 and miR-210 seed families, which cotargets critical tumour suppressors via a central GUGC core motif. We subsequently define mutations in microRNA target sites using the AGO-CLIP microRNA target atlas and TCGA exome-sequencing data. These combined analyses identify pan-cancer oncogenic cotargeting of the phosphoinositide 3-kinase, TGFβ and p53 pathways by the miR-17-19-130 superfamily members

    MuSiC: Identifying mutational significance in cancer genomes

    Get PDF
    Massively parallel sequencing technology and the associated rapidly decreasing sequencing costs have enabled systemic analyses of somatic mutations in large cohorts of cancer cases. Here we introduce a comprehensive mutational analysis pipeline that uses standardized sequence-based inputs along with multiple types of clinical data to establish correlations among mutation sites, affected genes and pathways, and to ultimately separate the commonly abundant passenger mutations from the truly significant events. In other words, we aim to determine the Mutational Significance in Cancer (MuSiC) for these large data sets. The integration of analytical operations in the MuSiC framework is widely applicable to a broad set of tumor types and offers the benefits of automation as well as standardization. Herein, we describe the computational structure and statistical underpinnings of the MuSiC pipeline and demonstrate its performance using 316 ovarian cancer samples from the TCGA ovarian cancer project. MuSiC correctly confirms many expected results, and identifies several potentially novel avenues for discovery

    Somatic neurofibromatosis type 1 (NF1) inactivation characterizes NF1-associated pilocytic astrocytoma

    Get PDF
    Low-grade brain tumors (pilocytic astrocytomas) arising in the neurofibromatosis type 1 (NF1) inherited cancer predisposition syndrome are hypothesized to result from a combination of germline and acquired somatic NF1 tumor suppressor gene mutations. However, genetically engineered mice (GEM) in which mono-allelic germline Nf1 gene loss is coupled with bi-allelic somatic (glial progenitor cell) Nf1 gene inactivation develop brain tumors that do not fully recapitulate the neuropathological features of the human condition. These observations raise the intriguing possibility that, while loss of neurofibromin function is necessary for NF1-associated low-grade astrocytoma development, additional genetic changes may be required for full penetrance of the human brain tumor phenotype. To identify these potential cooperating genetic mutations, we performed whole-genome sequencing (WGS) analysis of three NF1-associated pilocytic astrocytoma (PA) tumors. We found that the mechanism of somatic NF1 loss was different in each tumor (frameshift mutation, loss of heterozygosity, and methylation). In addition, tumor purity analysis revealed that these tumors had a high proportion of stromal cells, such that only 50%–60% of cells in the tumor mass exhibited somatic NF1 loss. Importantly, we identified no additional recurrent pathogenic somatic mutations, supporting a model in which neuroglial progenitor cell NF1 loss is likely sufficient for PA formation in cooperation with a proper stromal environment

    A framework for automated enrichment of functionally significant inverted repeats in whole genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA transcripts from genomic sequences showing dyad symmetry typically adopt hairpin-like, cloverleaf, or similar structures that act as recognition sites for proteins. Such structures often are the precursors of non-coding RNA (ncRNA) sequences like microRNA (miRNA) and small-interfering RNA (siRNA) that have recently garnered more functional significance than in the past. Genomic DNA contains hundreds of thousands of such inverted repeats (IRs) with varying degrees of symmetry. But by collecting statistically significant information from a known set of ncRNA, we can sort these IRs into those that are likely to be functional.</p> <p>Results</p> <p>A novel method was developed to scan genomic DNA for partially symmetric inverted repeats and the resulting set was further refined to match miRNA precursors (pre-miRNA) with respect to their density of symmetry, statistical probability of the symmetry, length of stems in the predicted hairpin secondary structure, and the GC content of the stems. This method was applied on the <it>Arabidopsis thaliana</it> genome and validated against the set of 190 known Arabidopsis pre-miRNA in the miRBase database. A preliminary scan for IRs identified 186 of the known pre-miRNA but with 714700 pre-miRNA candidates. This large number of IRs was further refined to 483908 candidates with 183 pre-miRNA identified and further still to 165371 candidates with 171 pre-miRNA identified (i.e. with 90% of the known pre-miRNA retained).</p> <p>Conclusions</p> <p>165371 candidates for potentially functional miRNA is still too large a set to warrant wet lab analyses, such as northern blotting, on all of them. Hence additional filters are needed to further refine the number of candidates while still retaining most of the known miRNA. These include detection of promoters and terminators, homology analyses, location of candidate relative to coding regions, and better secondary structure prediction algorithms. The software developed is designed to easily accommodate such additional filters with a minimal experience in Perl.</p
    corecore